arXiv: 1503.07503v2 [cond-mat.mtrl-sci] 28 Mar 2015 


Accelerated materials property predictions and design using motif-based fingerprints 

Tran Doan Huan,^ Arun Mannodi-Kanakkithodi,^ and Rampi Ramprasad^’j^ 

^Institute of Materials Science, University of Connecticut, 

97 North Eagleville Rd., Unit 3136, Storrs, CT 06269-3136, USA 
(Dated: March 31, 2015) 

Data-driven approaches are particularly useful for computational materials discovery and design 
as they can be used for rapidly screening over a very large number of materials, thus suggesting 
lead candidates for further in-depth investigations. A central challenge of such approaches is to 
develop a numerical representation, often referred to as a fingerprint, of the materials. Inspired by 
recent developments in chem-informatics, we propose a class of hierarchical motif-based topological 
fingerprints for materials composed of elements such as C, O, H, N, F, etc., whose coordination 
preferences are well understood. We show that these fingerprints, when representing either molecules 
or crystals, may be effectively mapped onto a variety of properties using a similarity-based learning 
model and hence can be used to predict relevant properties of a material, given that its fingerprint 
can be dehned. Two simple procedures are introduced to demonstrate that the learning model can 
be inverted to identify the desired fingerprints and then, to reconstruct molecules which possess a 
set of targeted properties. 


I. INTRODUCTION 

Data-driven approaches towards materials design and 
discovery are rapidly increasing in popularity, de¬ 
mand and potency.^^^^ This emerging trend is fu¬ 
eled by the availability and emergence of large mate¬ 
rials databases,as well as our ability to progres¬ 
sively accum ulate materials data via high-throughput 
computation j^^ l ^^ and experiments.l^^Jti^ Data-driven 
strategies aimed at rapid property predictions, and ul¬ 
timately to rational or informed materials design, rely 
on exploiting the information content of past data, and 
using such information within heuristic or statistical in- 
terpolative learning models to provide estimates of prop¬ 
erties of a new material. This approach is entirely anal¬ 
ogous to similar pursuits undertaken within chem- and 
bio-informatics wherein lead candidates worthy of fur¬ 
ther in-depth investi gation s are identified rapidly in a 

first-level of screening 

Data-driven property prediction strategies have two 
steps. The first involves representing materials numer¬ 
ically via descriptors, attribute vectors, or fingerprints. 
In the second step, using available “training” data sets, 
a mapping is established between the numerical represen¬ 
tation of materials and their properties, thus leading to a 
prediction model. Subsequently, the properties of a new 
material are estimated using this model after reducing 
the material to its numerical representation. 

One of the central challenges in this whole process 
is deciding on an appropriate and acceptable numeri¬ 
cal representation of materials. The specific choice of 
this representation is entirely application dependent, and 
can range from high level desc riptor s (e.g., d-band cen¬ 
ter, atomic electronegativities^^ ^ to t opological fea¬ 
tures (e.g., substructural mot ifto microscopic 
fingerprints that may capture chemical and configura¬ 
tional degrees of freedom (e.g., coulomb matrix, sym¬ 
metry functionsRegardless of the specific choice, 
the representations are expected to satisfy certain basic 


requirements. These include invariance of the represen¬ 
tation with respect to transformations of the material 
such as translation, rotation, and permutation of like el¬ 
ements. Moreover, it is desired that the representation 
be intuitive, elegant and physically and chemically mean¬ 
ingful. 

In this co ntribu tion, inspired by developments in chem- 
informatics,!^^^ we propose a class of hierarchical motif- 
based topological fingerprints. This choice, in which the 
motifs are molecular fragments of varying sizes, is par¬ 
ticularly suited to representing molecules and solids com¬ 
posed of elements such as H, C, N, O, F, etc., whose coor¬ 
dination preferences are well understood. Large datasets 
of molecules and solids are considered, and it is shown 
that the fingerprints may be effectively mapped to a va¬ 
riety of properties using a similarity based learning al¬ 
gorithm. Moreover, it is demonstrated that the learning 
model may be inverted to identify fingerprints, and sub¬ 
sequently, to reconstruct actual molecules that possess a 
desired set of target properties. 

II. DATASETS 

In the present work, we restrict ourselves to systems 
composed of C, O and H. We used two datasets, one 
for molecules and one for crystals, to demonstrate the 
applicability of the proposed fingerprints. Of these two 
datasets, the former was taken from Ref. M while the 
latter was prepared by us. 

A. Molecule dataset 

A dataset of more than 134,000 small molecules made 
up of C, O, H, N, and F was reported in Ref. [191 This re¬ 
liable dataset, which contains the optimized geometries, 
and energetic, electronic, and thermodynamic properties 
calculated using the B3LYP hybrid exchange-correlation 
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(XC) functional and the 6-31G(2df,p) basis set with the 
Gaussian 09 software, set s up the stage for many inter¬ 
esting data-mining worksA subset of this dataset, 
containing 45,708 molecules composed of G, O, and H 
was used in this work. Five properties were considered, 
including the atomization energy fat, the energy gap F^hl 
between highest occupied and lowest unoccupied molec¬ 
ular orbitals (HOMO-LUMO gap), the isotropic polariz¬ 
ability ct, the heat capacity Cv, and the zero-point vibra¬ 
tion energy fzp- 


B. Crystal dataset 

In addition to the molecules dataset, we prepared an¬ 
other dataset of 215 organic crystals comprising of G, O, 
and H. This includes 

1. 12 existing polymers composed of G, O, and H, 

2. 16 new polymer structures predicted by the 

minima-hopping met ho and USPE}(!^for 16 

quasi-one-dimensional polymer chain models re¬ 
ported in Ref. [3l 

3. 34 organic crystals composed of G and H and 153 
organic crystals composed of G, O, and H obtained 
from Crystallography Open Database^ 

The obtained structures were optimized by first- 
principles calculations within the DFT formalism as 
implemented in Vienna Ab initio Simulation Pack¬ 
age (vASP),p^M^ utilizing the semi-local rPW86 XG 
functionaP^ and a plane wave energy cutoff of 400 eV. 
A Monkhorst-Pack k-point meslP^ with the spacing of 
no more than 0.15A“^ in the reciprocal space were used 
for sampling the Brillouin zone, while the van der Waals 
interactions were estimated with the non-local density 
functional vdW-DF2.SIl Gonvergence was assumed when 
the atomic forces exerting on the atomic sites are smaller 
than 0.01 eV/A. The entire crystals dataset, which in¬ 
cludes the optimized structures, the atomization energies 
fat, the band gaps Eg, and the electronic and ionic parts 
of the dielectric constants, Cgiec and eion, can be found in 
the Supplemental Material 


III. FINGERPRINTS 

A hierarchy of equilibrium structure fingerprints of 
the same family with increasing levels of sophistication 
are proposed here. The construction of fingerprints was 
guided by two simple chemical concepts, i.e., chemical 
bonds and coordination number. The former intuitively 
characterizes the short-range interatomic interactions^ 
while the latter is the number of bonds involving a given 
atom. In major classes of materials composed of light 
elements like G, H, O, N, and F, these concepts are well- 
defined. In particular, the length of a given bond involv¬ 
ing these elements falls in a narrow range (see Refs. [44] 


C2 C3 C4 01 02 HI 

y4 H 

C2-C3 C3-C4 02-04 

C2-C3-C4 02-C4-C4 HI-02-04 

FIG. 1. (Color online) Illustration of the atom types (Mi, 
top row), some of the bond types {Ai-Bj, middle row) and 
two-bond catenations [Ai-Bj-Ck^ bottom row) of materials 
composed by carbon, oxygen, and hydrogen. 


and M for a comprehensive bond length statistics). For 
instance, the equilibrium length of a single bond between 
two G atoms is 1.50A, the length of a double bond be¬ 
tween two G atoms is 1.45A, and the length of a d ouble 
bond between a G atom and an O atom is 1.20A.E2EI1 
The coordination number is also well-defined, i.e., for a G 
atom, it can only be 2, 3, or 4 while each O atom can gen¬ 
erally bond with 1 or 2 other atoms. Therefore, atoms in 
a structure can be unambiguously classified (or labeled) 
by Ai where A is the type of the element (M G {G, O, H}) 
and i is its coordination number. Likewise, bonds can be 
specified by the types of its two ends, e.g., Ai-Bj. For the 
datasets of G, O, and H, the six possible atom types are 
G2, G3, G4, 01, 02, and HI while there are sixteen chem¬ 
ically permissible types of bonds, namely G2-G2, G2-G3, 
G2-G4, G2-01, G2-02, G2-H1, G3-G3, G3-G4, G3-01, 
G3-02, G3-H1, G4-G4, C4-02, G4-H1, 02-02, and 02- 
Hl. Except C2-01, G2-02, and 02-02, thirteen of them 
are present in our molecules and crystals datasets. The 
atom and bond types belong to a family of related struc¬ 
tural building units (subsequently described) that can 
be used to numerically represent the materials structures 
and hence, are used to define the fingerprints. In partic¬ 
ular, the order fingerprint is defined in terms of 
its components as 





( 1 ) 


Here, is the number of building units (or fragments or 
motifs) of type k and Vat is the number of atoms either 
in the molecule or in the unit cell of a crystal. Eour 
types of fingerprints, namely and are 

discussed in the following subsections. 
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A. 0^^-order fingerprint, 


The simplest (0^^-order) fingerprint represents 
the fractions of all the element types A existing in 
the structures, i.e., k = A. Therefore, in the defini- 
A is the number of atoms of ele- 


tion 


(B of f(0), 




ment A. This fingerprint is a three-dimensional vector 
whose components satisfy a simple normalization condi- 

tion E^e/c,o,H> fA = 1- 


B. l®^-order fingerprint, 

Next in the hierarchy is the case k, = Ai in which 
is the number of A atoms which are fold coor¬ 
dinated. fis a 6-dimensional vector, satisfying several 
constraints established from the definition or from the 
chemistry. The first one is the normalization condition, 
given as 


E/iV = l- (2) 


Within the two datasets, all the C2 atoms should be 
grouped by pairs, forming triple C=C bonds. Therefore, 
the number of C2 atoms, which is N^t x must be an 
even integer. Moreover, since each C3 atom only make a 
double bond with either an 01 atom or another C3 atom, 

one must have while Aat x is an 

even number. By examining the connectivity of a struc¬ 
ture, another constraint reads 


- fez + fol = ^ - No-d) 


(3) 


where Nq is the number of closed loops of bonds and 
d is a structure-dependent parameter. For molecules 
and crystals composed of isolated substructures (or 
molecules), d = 0 while for crystals composed of con¬ 
nected substructures, d > 0. The derivation of this con¬ 
straint is given in Appendixj^ The last constraint of f 
is written in the form of a recursion relation, i.e., 

( 1 ) 


C. 2“‘^-order fingerprint, 

Both and are local, representing the density 
of the atom types of a material. The equilibrium inter¬ 
atomic distance is somehow captured by the 2^^-order 
fingerprint f where all the possible bonds are counted. 

is a 13-dimensional vector whose components, 
represent the normalized number of the Ai-Bj 


bonds in the structure. From can readily be 

determined by a recursion relation 


/S> = E 

Bj 




^ A2) 

J Ai-Bj 


(5) 


where is used to remove the double counting when 

Ai = Bj [see Appendix for the derivation of ^]. 
Through this recursion relation, all the constraints that 
obeys are applicable for We note that 

was discussed in several previous works, e.g., in Refs. 
|25l uni and [47| under the name of “bond counting”. This 
fingerprint can also be regarded as a generalization of 
“doubles”, the fingerprint defined in Ref. |20]for the chain 
models of polymers. 


D. 3’^^-order fingerprint, 


In the 3^‘^-order fingerprint the number of two- 
bond catenation is represented, i.e., k = Ai-Bj-Ck. In 
particular, the definition for fj^2j^i-Bj-ck iiivolves 
n^i-Bj-ck^ which is the numoer of Ai-Bj-Ck sequences, 
or equivalently, the catenation of two bonds Ai-Bj and 
Bj-Ck. Considering compounds of C, O, and H, there 
are 125 possible distinct catenation of two bonds Ai-Bj 
and Bj-Ck. From can be determined as (see 

Appendix H 


^(2) 

J Ai-Bj 


= E 

Ck 

= E 


J Ai-Bj-Ck 
i_l JBj-Ai-Ck 


( 6 ) 


Similar to can be viewed as a generalization of 

“triples”, the fingerprint examined in Ref. [20l 


IV. PROPERTY PREDICTION MODEL 


A learning model is critical in order to map the fin¬ 
gerprints to properties. In this w ork, w e chose Gaus¬ 
sian kernel ridge regression (KRR),I^®^^ the technique 
which has success fully been used in material properties 
prediction j^^ l ^^ l ^^ -^^ Within this model, the input fin¬ 
gerprints are transformed into higher-dimensional space 
whereby a linear relation between the transformed fin¬ 
gerprints and the associated properties can be estab¬ 
lished. This mapping involves the distances between fin¬ 
gerprints and can be regarded as a similarity-based pre¬ 
diction model, i.e., similar properties may be predicted 
for materials with similar fingerprints. 

In the KRR model, the property of a structure /i 
is predicted as an weighted sum of Gaussians 


= E exp 

V 




( 7 ) 
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FIG. 2. (Color online) Learning curves corresponding to ^at, 
&p, Cv^ and F^hl- For each model, and 

are used to represent the molecules. Calculated data is given 
by symbols while curves are the guide for the eyes. 


where v runs over all the fingerprints in the train¬ 
ing dataset. Here, is the distance between finger¬ 
prints li and z^, defined as the Euclidean metric = 

~ /^)^- Gaussian width parameter a and 
the regression coefficients are determined within the 
training p hase wh ence a regularized objective function is 
minimize dj 5 | 48 | 49 |p)^j^^j^g phase, a and the regulariza¬ 

tion parameter are determined by F-fold cross validation 
on the training set (F = 5 in this work). Within this 
method, the training dataset is split into k bins, any of 
the bins is considered to be a new test dataset while the 
remaining k — 1 bins form a new training datatest. This 
procedure is repeated for each of the k bins and for every 
value of a and A on a preselected logarithmic-scale grid. 
The optimal values of a and A, i.e., those leading to the 
minimum /c-fold cross-validation (mean absolute) error, 
are used to compute of the entire dataset. 


mimic the learning and prediction processes, the dataset 
was randomly partitioned into a training dataset and 
a test dataset. The KRR model was then trained on 
the training dataset using five-fold cross validation be¬ 
fore predictions were made on the test dataset. We show 
in Fig. [^the learning curves of fat, fzp, ct, and Ehl, 
plotting the training and test errors against the num¬ 
ber of molecules in the training dataset (data reported 
in this figure was averaged over 30 independent runs). 
In addition, predictions for the test dataset of 44,708 
molecules after training the KRR model on a dataset 
of 1,000 molecules are shown in Fig. As discussed in 
detail below, both Fig. 2 and Fig. 3 indicate that all of 
these properties can be very well predicted by using ei¬ 
ther f or f , provided that the KRR model is trained 
on a training dataset of 200 or more data points. 

The general tendency, as revealed by Fig. is that 
higher-order fingerprints offer more accurate predictions. 
The 0^^-order fingerprint f can be used to roughly esti¬ 
mate energy-related quantities, i.e., fat and fzp while it 
can not be used for others. For instance, Erl can not be 
predicted with f because this fingerprint is totally local 
in nature, encoding no information at any finite range. 
Consequently, the finite conjugation length, known to sig¬ 
nal the energy gap reduction in complex (conjugated) 
systems (see, for example Ref. [501), is not captured by 
f(o)^ Fingerprints of higher orders, e.g., and 

contain some information at increasing ranges, al¬ 
lowing for systematically better predicting Erl- These 
fingerprints also work sufficiently well in predicting fat 
and fzp. With the averaged error in predicting fat 
is 25 meV/atom while this error is reduced to 20 
meV/atom and 18 meV/atom if and respec¬ 
tively, are used. The very good power of f in predict¬ 
ing fat reproduces the similar conclusions drawn for the 
“bond counting” fingerprint by Ref. [47l This behavior is 
understandable because the dissociation energy of chem¬ 
ical bonds in organic molecules and crystals, which dom¬ 
inates the stability of these systems, are well-definecP^ 
in the same fashion with the bond length as previously 
discussed. Interestingly, this predictive power can signif¬ 
icantly be improved if more advanced fingerprints, i.e., 
those can capture the small perturbatio ns of i nteratomic 
distances like Coulomb matrix, are used.^^^^ Compared 
to and is significantly better in predicting 

Cy. The considerable improvement in the predictions of 
a when is used instead of may indicate the key 
contribution from polar bonds to the high-value regime 
of a. 


V. PROPERTY PREDICTION RESULTS 

A. Molecules dataset 

The four fingerprints considered, namely 
and f , were used to represent the molecules dataset. To 


B. Crystals dataset 

We performed similar predictions for the dataset of 215 
crystals containing C, O, and H. Using the KRR model 
coupled with and five properties of 

these crystals, including the atomization energies fat, the 
band gap Eg, the electronic dielectric constant eeiec, the 
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FIG. 3. (Color online) Predictions for ^at, ^zp, ct, Cy^ and F^hl of the molecules dataset, using and (from 

top row to bottom row). For each prediction, the training dataset consists of 1,000 points while the test dataset includes the 
remaining 44,708 data points. 


ionic dielectric constant eion, and the total dielectric con¬ 
stant ctot = ^eiec + ^ioni Were predicted. We show in 
Fig. [^the learning curves, representing the errors of the 
predictions using these fingerprints, averaged over 100 in¬ 
dependent runs. In Fig. the predictions for the five 
properties are given, using the KRR model trained on a 
random training set of 150 data points. 

Clearly, the tendency of the prediction performances 
on the crystals dataset is similar to those of the molecules 
dataset, i.e., high accuracies are obtained with finger¬ 
prints of higher orders, and properties which are governed 
by long-ranged information, e.g., band gap F^g, can only 
be predicted with high-order fingerprints. For the atom¬ 
ization energy predictions with and leads 
to quite high averaged errors, which reduced to 18 
meV/atom and 15 meV/atom when and re¬ 
spectively, were used. Overall, all the five examined prop¬ 
erties can be predicted well when high-order fingerprints 
are used to represent the crystals. For instance, by em¬ 
ploying the averaged error in predicting is ^ 0.45 
eV while the electronic dielectric constant Ceiec and the 
ionic dielectric constant eion can be predicted with an 
averaged error of 0.1 — 0.2. 


VI. UTILITIES OF THE FINGERPRINTS 

The demonstrated predictive power of the KRR model, 
which uses f to represent materials structures, inspires 
the idea of using this model to rationally optimize mate¬ 
rials for a targeted property the concept often re¬ 
ferred to as “inverse design” fact, a large number 

of success stories along this direction have been reported 
in the past, using various approaches, e.g., iteratively op¬ 
timizing the properties of a given compound or on-the-fly 
screening when searching for stable stmet ures.^^^J^^ Here, 
our idea is that starting from a trained KRR model, fin¬ 
gerprints which correspond to the desired properties can 
be predicted. Then, molecular structures will be recon¬ 
structed from the predicted fingerprints. Finally, the tar¬ 
geted properties will be verified by DFT calculations at 
the same level with those used for the training dataset. 

The greatest challenge of this procedure is to ensure 
that the predicted fingerprint is physically and chemi¬ 
cally meaningful, i.e., at least one material structure can 
be reconstructed from it Therefore, one must math¬ 
ematically define the subspace of the meaningful finger¬ 
prints, and then limit the search for desired fingerprints 
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FIG. 4. (Color online) Learning curves corresponding to ^at, 
Ceiec, Cion, c, and Eg determined by using and 

f(3) representing the crystals structures. Calculated data 
is shown by symbols while curves are the guide for the eyes. 


within this subspace. We present two approaches which 
can be used for designing molecules (the work of design¬ 
ing crystals is not considered here). 


A. Design via enumeration 

The central idea of this approach is that the com¬ 
ponents of a given fingerprint can be enumerated in a 
given way so that it is meaningful. We used for 
a demonstration because predictions using this finger¬ 
print are good while its dimensionality is not too high 
like We first implemented the applicable rules in¬ 

volving bonds and coordination numbers by defining five 
“backbone” blocks. They include C4, C=C (a pair of C3 
atoms with a double bond), C=C (a pair of C2 atoms 
with a triple bond), C=0 (one C3 and one 01 atom 
linked by a double bond), and 02. By definition, all of 
the dangling bonds starting from these blocks are single, 
thus any of them can be connected to others without any 
constraint. Then, given a set of backbone blocks, all the 
possible arrangements can be scanned, keeping track of 
the connectivity to eliminate some dangling bonds, and 
saturating the remaining dangling bonds by either HI 
or OH, referred to as “ending” blocks. From the ob¬ 
tained arrangements, can be unambiguously deter¬ 


mined and their properties were predicted. Those with 
targeted properties were singled out to rebuild molecular 
structures for validating calculations. We show in Fig. 

two optimized molecules constructed from two of the 
predicted fingerprints, labeled by A and B, accompanied 
by the predicted and calculated F^hl and a. The results 
given in Fig. indicate that the desired molecules are 
indeed obtained. 


B. Design via inversion 


Different from the enumeration approach, this proce¬ 
dure aims to directly determine the fingerprints, starting 
from desired properties. This goal can be achieved by 
optimizing an objective function, aiming towards the de¬ 
sired properties while applying the constraints that en¬ 
sure the fingerprints considered are meaningful. Because 
the reconstruction step requires a simple enough finger¬ 
print, was selected for this approach. Among the 
constraints established for ([^ and ^ are explicitly 
imposed in the objective function defined below 


G[f(i),Ai,A2] = (P - Popt)'+ Ai 

+ A 2 


f(l) 

Jm 


2 


E/A’- 

L Ai 

f(l) , f(l) 
JC 3 ^01 


( 8 ) 


Here, Ai, and A 2 are the Lagrange multipliers associated 
with the constraints while V is the property (or proper¬ 
ties) of the trial fingerprint predicted by the trained 
KRR model. In practice, we evaluated V by averaging 
many predictions, each of them was given by the KRR 
model trained on a randomly selected training dataset of 
1,000 data points. All the terms in ^ are given in the 
quadratic form to smoothen G. Generally, the problem 
of minimizing G[f^^\Ai,A 2 ] (performed with simulated 
annealin^^ in this work) returns many solutions 
For each of them, Aat was determined by minimizing an¬ 
other objective function D[F] defined as 


D[F(i)] = ^ - Hint (v,t4V) 

Ai 


(9) 


where nint(x) returns the closest integer to x. Once Aat 
is determined, a post-screening step is performed to con¬ 
sider the possibility of Nq > 0 and to single out the 


fingerprints so that AatF^^^ and Aat 
positive even numbers. Such fingerprints are meaningful, 
i.e., molecules can be built up from any of them. 

We demonstrate this procedure by optimizing two 
properties simultaneously, i.e., F^hl and a. We note that 
these properties seem to be competing, as shown in Fig. 
where an asymptotic limit of the form a ^ 1/ F^hl can be 
seen (similar limit between two related properties of crys¬ 
tals, namely Cgiec and Eg was documented earlier in Ref. 
[HD. An examination of Fig. reveals that the predic¬ 
tion of a using f is fairly good in the region of a < 0.8 
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FIG. 5. (Color online) Predictions for Sat, Ceiec, Cion, e, and F^gap of the crystals dataset, using and (from 

top row to bottom row). For each prediction, the training set size is 150 and the remaining 70 points form the test set. 


A^/atom. For this reason, we searched for new molecules, 
i.e., those that do not exist in the molecules dataset, of 
which 0.6 < a < 0.7 A^/atom while F^hl > 7 eV and 
show the results in Fig. While the calculated F^hl of 
the molecules dataset can reach the upper limit of 10 


Predictions Predictions 



A B 


FIG. 6. (Color online) Optimized molecules, constructed from 
two predicted hngerprints A and B, shown with the predicted 
and calculated values of F^hl and a. Carbon, oxygen, and 
hydrogen atoms are given in dark brown, red, and pink. 


eV, all the predictions for F^hl by the KRR model are be¬ 
low 9 eV. The reason is given in Fig. which clearly im¬ 
plies that when f is coupled with the KRR model, high 
values of F^hl (8 < F^hl < 10 eV) are generally underes¬ 
timated by roughly 1 eV. Three of the predicted finger¬ 
prints, labeled by C, D, and E, were selected for rebuild¬ 
ing new molecules. From either C or E, only one molecule 
can be constructed while many different molecules corre¬ 
spond to D. All of the molecules reconstructed from C, 
D, and E were optimized and then their a and Erl were 
calculated with Gaussian 09,1^ using the 6 -31G(2df,p) 
basis set and the B3LYP XC functionalThe results 
are summarized in Table |I] and in the inset of Eig. 
demonstrating that the molecules with desired values of 
a and Erl were actually obtained. Detailed informa¬ 
tion on all of the designed molecules can be found in the 
Supplemental Material .1^ 
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FIG. 7. (color online) F^hl — ol log-log plot of the molecules 
dataset, shown by forest-green symbols while the predicted 
fingerprints are shown by red diamonds within the regime of 
desired properties, i.e., 0.6 < ol < 0.7A^/atom and F^hl > 7.0 
eV. In the inset, the predicted and calculated properties of the 
molecules reconstructed from three predicted fingerprints, i.e., 
C, D, and E, are shown by closed and open symbols: triangles 
for C, circles for D, and squares for E. The dashed line sketches 
the limit ol ~ 1/Ehl addressed in the text. 


C. Remarks 

It is worth noting that the key feature of which 
is useable for the described enumeration and inversion 
design procedures is their discontinuity with respect to 
slight configurational perturbations. Because all the pos¬ 
sible chemical bonds appearing in a molecule compris¬ 
ing C, O, and H are well-defined, it is very likely that 
the optimization step performed on the reconstructed 
molecules preserves the predicted fingerprint. Moreover, 
the efficiency of the designing approaches depends on 
several factors, including the prediction accuracy of the 
fingerprints used. Although predictions by using high- 
order fingerprints are systematically better, the complex¬ 
ity generated by their high dimensionality is significant. 
Comparing to the procedure described above, that utiliz¬ 
ing f or f needs roughly 10 and 100 more constraints 
for ensuring the considered fingerprints are meaningful. 
If the dimensionality of f can considerably be reduced, 
it may then be used for the inversion approach. 

VII. CONCLUSIONS 

To summarize, we have systematically studied a fam¬ 
ily of motif-based topological fingerprints which can nu¬ 
merically represent major classes of molecules and crys¬ 
tals. By using a similarity based learning algorithm, 
these fingerprints can be mapped onto various properties 


TABLE I. Predicted and calculated values of ol (in A^/atom) 
and Ehl (in eV) of the molecules designed from three pre¬ 
dicted fingerprints C, D, and E. Data from this Table is also 
shown in the inset of Eig. 


Label Aat 

Predicted 

Galculated 

a Ehl 

a 

Ehl 

G 

11 

0.689 7.273 

0.654 

7.964 

D 

18 

0.670 7.363 

0.664 - 0.699 6.502 - 7.348 

E 

14 

0.607 8.612 

0.597 

8.909 


of molecules and crystals, significantly accelerating their 
properties prediction. A major advantage of these fin¬ 
gerprints is clearly demonstrated via two procedures for 
designing molecules, one by enumeration and the other 
by inversion. These procedures rely on the accelerated 
properties prediction to identify the desired fingerprints, 
and then to reconstruct molecules that possess one or 
more targeted properties. We note that although only 
molecules and crystals comprising C, O, and H are con¬ 
sidered in this contribution, our results can straightfor¬ 
wardly be generalized to those containing other light el¬ 
ements whose coordination preferences are well estab¬ 
lished, e.g., N and E. 
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Appendix A: Constraint of derived from 

elementary chemical rules 

Constraint ^ was derived with an assumption that 
the desired molecular structure is connected, i.e., any 
pair of atoms are connected by at least one sequence of 
the allowed chemical bonds. Let us take a molecule in 
which nj^i is the number of the blocks Ai. Starting from 
the applicable chemical rules, all the two-fold coordinated 
carbon atoms are grouped by pairs, forming nc 2/2 units 
of C = C, each of which is a pair of carbon atoms linked 
by a triple bond. Next, noi one-fold coordinated oxygen 
atoms must bond with noi three-fold coordinated carbon 
atoms to form noi units of C = O. Then, the remain¬ 
ing nc 3 — noi three-fold coordinated carbon atoms are 
grouped together by pairs, forming (nc 3 —rioi)/2 units of 
C = C. Therefore, the set of the blocks Ai now contains 
nc2/2 + noi + (nc3-^oi)/2 + nc4+^02 units of C = C, 
CO, C = C, C4 and 02. Assuming that these units 
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are isolated, the total number of dangling bonds starting 
from them is 2 (nc 2 / 2 ) + 2noi +4[(nc3 -^oi)/2] +4nc4 + 
2 no 2 5 or simply 

nc2 + 2nc3 + 4nc4 + 2no2- (Al) 

By joining nc 2 /2 + noi + (nc3 - noi)/2 + nc4 + no 2 units 
together, the number of dangling bonds that will be anni¬ 
hilated to form inter-unit bonds is 2 [nc 2 / 2 +noi+ (u.c 3 — 
U'Oi)/2 + ncA + ^.02 ~ 1] + 2nQ where uq is the num¬ 
ber of loops of bonds, each of which costs extra 2 bonds. 
Therefore, the number of remaining dangling bonds is 

nc3 + 2nc4 — u-oi — 2nQ + 2. (A2) 

All of these dangling bonds must be saturated by nni 
hydrogen atoms, thus 


riYii = nc3 + 2nc4 - rioi - 2nQ -1- 2. 


(A3) 


The constraint (§ can then be obtained when we divide 
Eq. (A3) by A^at- This constraint is applicable not only 
for molecules but also for crystals formed by repeatedly 
placing an isolated molecule in a periodic grid. If these 
molecules are not isolated, i.e., they form a network of 
d dimensions, 2d dangling bonds are used to form the 
network (assuming that the network are formed only by 
single bonds). Thus, Eq. A3 is given as 


riHi = Tics + 2nc4 ~ tiqi — 2nQ — 2d -\-2. (A4) 


In the general case when not only single bonds involve 
the network formation, the parameter d used in Eq. A4 
is not necessarily an integer. 


the number of Ai-Bj bonds, the Ai-Ai bonds are counted 
twice. Therefore 


T^Ai — • 
I 


Bj 


(Bl) 


Then, th e re cursion relation of can be obtained by 
dividing (Bl) by the total number of atoms A^at- 


2. Recursion relations of 


Similar to the derivation of (Bl), the fingerprint com- 
ponent can be determined by counting the number 

of Ai-Bj-Ck sequences before dividing by j — 1. In such 
a procedure, the Ai-Bj-Ai sequences are counted twice. 
Thus, after removing the double counting, we obtain 


1 

TlAi-Bj — j _ Y 


TlAi-Bj-Ck - j:TlAi-Bj-Ai 

. Ck 


(B2) 


We note that one can also count the number of Bj-Ai-Ck 
sequences before dividing the total number hy i — 1. Thus 


Appendix B: Derivation of the recursion relations of 

f(2) and f(^) 


1. Recursion relations of f^^^ 

The number n^i of blocks Ai can be determined by 
counting all the bonds of Ai-Bj type. By summing all 


TlAi-Bj = 


i — 1 


yy TlBj-Ai-Ck — -jTlBj-Ai-Bj 
. Ck 


(B3) 


By dividing (B2) and 


by A'at: two equivalent recur¬ 


sion relat ions are obtained. Moreover, we note that (B2) 
and (B3) set up a constraint that f^^^ must also satisfy. 
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